2025-05-31-00-00
SWE-bench Goes Live!
Abstract
arXiv:2505.23419v1 Announce Type: cross Abstract: The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
摘要
问题修复任务(即模型生成补丁以修复现实世界中的错误)已成为评估大语言模型(LLM)能力的关键基准。尽管SWE-bench及其变体在该领域已成为标准,但它们存在关键局限性:自初始发布以来未进行更新,覆盖的代码库范围狭窄,且严重依赖人工进行实例构建和环境设置。这些因素阻碍了可扩展性,并带来了过拟合和数据污染的风险。在本研究中,我们提出了\textbf{SWE-bench-Live},一个旨在克服这些挑战的\textit{可实时更新}的基准。我们的初始版本包含1,319个任务,源自2024年以来创建的GitHub真实问题,涵盖93个代码库。每个任务均配有专用的Docker镜像以确保可复现的执行。我们基准的核心是\method,一个自动化筛选流程,它简化了从实例创建到环境设置的整个过程,消除了人工瓶颈,实现了可扩展性和持续更新。我们在SWE-bench-Live上评估了一系列最先进的智能体框架和LLM,揭示了与SWE-bench等静态基准相比存在的显著性能差距,即使在受控评估条件下也是如此。为了更好地理解这一差异,我们从代码库来源、问题时效性和任务难度等方面进行了详细分析。通过提供一个基于实时代码库活动的新颖、多样且可执行的基准,SWE-bench-Live为动态、真实的软件开发环境中LLM和智能体的严格且抗污染的评估提供了便利。